Chapter 10 Structured Corpus

There are a lot of pre-collected corpora available for lingustic studies. This chapter will demonstrate how you can load existing corpora in R and perform basic corpus analysis with these data.

10.1 NCCU Spoken Mandarin

CHILDES format

10.1.3 Metadata vs. Transcript

10.1.4 Word Tokenization

10.1.5 Word frequencies and Wordcloud

10.1.6 Concordances

10.1.7 N-grams (Lexical Bundles)

10.2 Connecting SPID to Metadata

Based on the metadata of each file hedaer, we can extract demographic information related to each speaker, including their ID, age, gender, etc.

10.3 More Socialinguistic Analyses

10.3.1 Check Ngram Distribution By Age Groups

Below20 Word Cloud

Order ggplot barplots by factor frequencies

10.3.2 Check Word Distribution of different genders